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Our  ability  to  collect  “big  data”  has  greatly  surpassed  our  capability  to  analyze 
it,  underscoring  the  emergence  of  the  fourth  paradigm  of  science,  which  is  data- 
driven  discovery.  The  need  for  data  informatics  is  also  emphasized  by  the  Mate¬ 
rials  Genome  Initiative  (MGI),  further  boosting  the  emerging  field  of  materials 
informatics.  In  this  article,  we  look  at  how  data-driven  techniques  are  playing 
a  big  role  in  deciphering  processing-structure-property-performance  relationships 
in  materials,  with  illustrative  examples  of  both  forward  models  (property  predic¬ 
tion)  and  inverse  models  (materials  discovery).  Such  analytics  can  significantly 
reduce  time-to-insight  and  accelerate  cost-effective  materials  discovery,  which  is 
the  goal  of  MGI.©  2016  Author(s).  All  article  content,  except  where  other¬ 
wise  noted,  is  licensed  under  a  Creative  Commons  Attribution  (CC  BY)  license 
(http://creativecommons.Org/licenses/by/4.0/).  [http://dx.doi.org/10.1063/L4946894] 


INTRODUCTION 

The  field  of  materials  science  relies  on  experiments  and  simulation-based  models  to  under¬ 
stand  the  “physics”  of  different  materials  in  order  to  better  understand  their  characteristics  and 
discover  new  materials  with  improved  properties  for  use  in  society  at  all  levels.  Lately,  the  “big 
data”  generated  by  such  experiments  and  simulations  has  offered  unprecedented  opportunities  for 
application  of  data-driven  techniques  in  this  field,  thereby  opening  up  new  avenues  for  accelerated 
materials  discovery  and  design.  The  need  for  such  data  analytics  has  also  been  emphasized  by  the 
Materials  Genome  Initiative  (MGI),1  which  envisions  the  discovery,  development,  manufacturing, 
and  deployment  of  advanced  materials  twice  as  fast  and  at  a  fraction  of  the  cost. 

Four  paradigms  of  science 

In  fact,  these  developments  in  the  field  of  materials  science  are  along  the  lines  of  how  science 
and  technology  overall  have  evolved  over  the  centuries.  For  thousands  of  years,  science  was  purely 
empirical,  which  here  corresponds  to  metallurgical  observations  over  the  “ages”  (stone,  bronze, 
iron,  steel).  Then  came  the  paradigm  of  theoretical  models  and  generalizations  a  few  centuries  ago, 
characterized  by  the  formulation  of  various  “laws”  in  the  form  of  mathematical  equations;  in  mate¬ 
rials  science,  the  laws  of  thermodynamics  are  a  good  example.  But  for  many  scientific  problems,  the 
theoretical  models  became  too  complex  with  time,  and  an  analytical  solution  was  no  longer  feasible. 
With  the  advent  of  computers  a  few  decades  ago,  a  third  paradigm  of  computational  science  became 
very  popular.  This  has  allowed  simulations  of  complex  real-world  phenomena  based  on  the  theo¬ 
retical  models  of  the  second  paradigm,  and  excellent  examples  of  this  in  materials  science  are  the 
density  functional  theory  (DFT)  and  molecular  dynamics  (MD)  simulations.  These  paradigms  of 
science  have  contributed  in  turn  towards  advancing  the  previous  paradigms,  and  today  these  are 
popular  as  the  branches  of  theory,  experiment,  and  computation  in  almost  all  scientific  domains.  The 
amount  of  data  being  generated  by  these  experiments  and  simulations  has  given  rise  to  the  fourth 
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FIG.  1.  The  four  paradigms  of  science:  empirical,  theoretical,  computational,  and  data-driven. 


paradigm  of  science2  over  the  last  few  years,  which  is  (big)  data  driven  science,  and  it  unifies  the 
first  three  paradigms  of  theory,  experiment,  and  computation/simulation.  It  is  increasingly  becoming 
popular  in  the  field  of  materials  science  as  well  and  has,  in  fact,  led  to  the  emergence  of  the  new 
field  of  materials  informatics.  Figure  1  shows  these  four  paradigms  of  science. 

Big  data 

Before  going  further,  it  would  be  useful  to  expand  on  the  concept  of  big  data,  and  what  it 
means  specifically  in  the  context  of  materials  science.  The  “bigness”  (amount)  of  data  is  certainly 
the  primary  feature  and  challenge,  but  several  other  characteristics  can  make  the  collection,  storage, 
retrieval,  analysis,  and  visualization  of  such  data  even  more  challenging.  For  example,  the  data  may 
be  from  heterogeneous  sources,  may  be  of  different  types,  may  have  unknown  dependencies  and 
inconsistencies  within  it,  may  have  parts  that  are  missing  or  not  reliable,  may  be  generated  at  a 
rate  that  could  be  much  greater  than  what  traditional  systems  can  handle,  may  have  privacy  issues, 
and  so  on.  These  issues  can  be  summarized  by  the  various  Vs  associated  with  big  data — volume, 
velocity,  variety,  variability,  veracity,  value,  and  visualization.  Of  these,  the  first  three  (volume, 
velocity,  and  variety)  are  specific  to  big  data  and  others  are  features  of  any  data,  including  big 
data.  Further,  each  application  domain  can  also  introduce  its  own  nuances  to  the  process  of  big 
data  management  and  analytics.  It  is  worth  noting  that  in  many  areas  of  materials  science,  until 
recently,  there  was  more  of  a  no  data  than  a  big  data  problem,  in  the  sense  that  open,  accessible 
data  have  been  rather  limited;  however,  recent  MGI-supported  efforts3-4  and  other  similar  efforts 
around  the  world  are  promoting  the  availability  and  accessibility  of  digital  data  in  materials  science. 
These  efforts  include  combining  experimental  and  simulation  data  into  a  searchable  materials  data 
infrastructure  and  encouraging  researchers  to  make  their  data  available  to  the  community.  Thanks 
to  such  efforts,  it  is  fair  to  say  that  the  sheer  complexity  and  variety  in  materials  science  data 
becoming  available  nowadays  requires  the  development  of  new  big  data  approaches  in  materials 
informatics.  For  example,  there  are  numerous  kinds  of  experimental  and  simulation-based  materials 
property  data  (e.g.,  physical,  chemical,  electronic,  thermodynamic,  mechanical,  structural),  engi¬ 
neering/processing  data  (e.g.,  heat  treatment),  image  data  (e.g.,  electron  backscatter  diffraction), 
spatio-temporal  data  (e.g.,  tomography,  structure  evolution),  unstructured  textual  data  (materials 
science  literature),  and  so  on.  Of  course,  several  of  these  types  of  data  are  often  coupled  together. 
Several  excellent  review  articles  on  big  data  in  materials  science  are  available  in  the  literature.3  7 

Processing-structure-property-performance  (PSPP)  relationships 

So  what  can  materials  informatics  and  big  data  do  for  a  real-world  materials  science  appli¬ 
cation?  It  is  well-known  that  the  key  of  almost  everything  in  materials  science  relates  to  PSPP 
relationships,8  which  are  far  from  being  well-understood.  Figure  2  shows  these  PSPP  relation¬ 
ships,  where  the  deductive  science  relationships  of  cause  and  effect  flow  from  left  to  right,  and  the 
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FIG.  2.  The  processing-structure-property-performance  relationships  of  materials  science  and  engineering,  and  how  materi¬ 
als  informatics  approaches  can  help  decipher  these  relationships  via  forward  and  inverse  models. 


inductive  engineering  relationships  of  goals  and  means  flow  from  right  to  left.  Further,  it  is  impor¬ 
tant  to  note  that  each  relationship  from  left  to  right  is  many-to-one,  and  consequently  the  ones 
from  right  to  left  are  one-to-many.  Thus,  many  processing  routes  can  potentially  result  in  the 
same  structure  of  the  material,  and  the  same  property  of  a  material  could  be  potentially  achieved 
by  multiple  structures.  Each  experimental  observation  or  simulation  can  be  thought  of  as  a  data 
point  for  a  forward  model  (e.g.,  a  measurement  or  calculation  of  a  property  given  the  processing, 
composition,  and  structure  parameters).  A  database  of  such  data  points  can  be  used  with  a  materials 
informatics  approach  such  as  predictive  analytics  to  build  data-driven  forward  models  that  can  run 
in  a  very  small  fraction  of  the  time  it  takes  for  doing  the  experiment  or  simulation.  This  acceleration 
of  forward  models  can  not  only  help  to  guide  future  simulations  and  experiments,  but  also  make  it 
possible  to  realize  the  inverse  models,  which  are  much  more  challenging  and  critical  for  materials 
discovery  and  design.  The  construction  of  inverse  models  is  typically  formulated  as  an  optimiza¬ 
tion  problem  wherein  a  property  or  performance  metric  of  interest  is  intended  to  be  maximized 
or  minimized,  subject  to  the  various  constraints  on  the  representation  of  the  material,  which  is 
typically  in  the  form  of  a  composition-  and/or  structure-based  function.  The  optimization  process 
usually  involves  multiple  invocations  of  the  forward  model,  and  thus  having  a  fast  forward  model  is 
extremely  valuable.  Further,  since  these  inverse  relationships  are  one-to-many,  a  good  inverse  model 
should  be  able  to  identify  multiple  optimal  solutions  (if  they  exist),  so  as  to  have  the  flexibility  to 
select  the  material  structure  that  can  be  attained  in  the  easiest  and  most  cost-effective  manner. 

In  the  rest  of  this  article,  we  shall  first  discuss  a  generic  workflow  for  conducting  materials 
informatics,  and  then  illustrate  it  with  examples  of  some  recent  advances  in  this  field  in  terms  of 
development  and  application  of  big  data  analytics  approaches  for  building  both  forward  and  inverse 
models  for  PSPP  relationships.  In  particular,  we  will  take  the  example  of  steel  fatigue  prediction  us¬ 
ing  a  set  of  experimental  data  (forward  models),9  predicting  the  stability  of  a  compound  using  DFT 
simulation  data  (forward  models)  and  subsequent  discovery  of  stable  ternary  compounds  (inverse 
models),10  and  structure-property  optimization  of  a  magnetoelastic  material  (inverse  models).1 1 


KNOWLEDGE  DISCOVERY  WORKFLOW  FOR  MATERIALS  INFORMATICS 

Figure  3  depicts  a  typical  end-to-end  workflow  for  materials  informatics.  A  variety  of  raw 
materials  data  as  discussed  earlier  could  be  stored  in  heterogenous  materials  databases  in  poten¬ 
tially  different  data  formats.  Given  a  task  at  hand  (say  developing  property  prediction  models),  the 
first  step  is  to  understand  the  data  format  and  representation,  and  do  any  necessary  preprocessing 
to  ensure  the  quality  of  the  data  before  any  modeling,  and  remove  or  appropriately  deal  with 
noise,  outliers,  missing  values,  duplicate  data  instances,  etc.  Usually  the  instances  and/or  attributes 
responsible  for  these  are  removed  if  they  are  easily  identifiable  and  sufficient  data  are  available, 
but  optimally  utilizing  incomplete  data  remains  an  active  area  of  research.  Examples  of  data  pre¬ 
processing  steps  include  discretization,  sampling,  normalization,  attribute-type  conversion,  feature 
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FIG.  3.  The  knowledge  discovery  workflow  for  materials  informatics.  The  overall  goal  is  to  mine  heterogenous  materials 
databases  and  extract  actionable  PSPP  linkages  to  enable  data-driven  materials  discovery  and  design. 


extraction,  feature  selection,  etc.  Such  data  preprocessing  can  either  be  supervised  or  unsupervised, 
based  on  whether  the  process  depends  on  the  target  attributes  (here  the  property  of  the  material  to  be 
predicted),  and  are  thus  usually  considered  separate  stages  in  the  workflow. 

Once  appropriate  data  preprocessing  has  been  performed  and  the  data  are  ready  for  modeling, 
one  can  employ  supervised  data  mining  techniques  for  predictive  modeling.  Caution  needs  to  be 
exercised  here  to  appropriately  split  the  data  into  training  and  testing  sets  (or  use  cross  validation), 
else  the  model  may  be  prone  to  overfitting  and  show  over-optimistic  accuracy.  If  the  target  attribute 
is  numeric  (e.g.,  fatigue  strength,  formation  energy)  regression  techniques  can  be  used  for  predic¬ 
tive  modeling  and  if  it  is  categorical  (e.g.,  whether  a  compound  is  metallic  or  not),  classification 
techniques  can  be  used.  Some  techniques  are  capable  of  doing  both  classification  and  regression. 
There  also  exist  several  ensemble  learning  techniques  that  combine  the  results  from  base  learners 
in  different  ways,  and  have  shown  to  improve  accuracy  and  robustness  of  the  model  in  some  cases. 
Table  I  lists  some  of  the  popular  predictive  modeling  techniques.  Apart  from  predictive  modeling, 
one  can  also  use  other  data  mining  techniques  such  as  clustering  and  relationship  mining  depending 
on  the  goal  of  the  project,  for  instance,  to  find  group  similar  materials  or  discovering  hidden  patterns 
and  associations  in  the  data. 

Proper  evaluation  of  data-driven  models  is  crucial.  A  data-driven  model  can,  in  principle, 
“memorize”  every  single  instance  of  the  dataset  and  thus  result  in  100%  accuracy  on  the  same 
data,  but  will  most  likely  not  be  able  to  work  well  on  unseen  data.  For  this  reason,  advanced 
data-driven  techniques  that  usually  result  in  black-box  models  need  to  be  evaluated  on  data  that 
the  model  has  not  seen  while  training.  A  simple  way  to  do  this  is  to  build  the  model  only  on 
part  of  the  data,  and  use  the  remaining  for  evaluation.  This  can  be  generalized  to  /Mold  cross 
validation,  where  the  dataset  is  randomly  split  into  k  parts,  k  -  1  parts  are  used  to  build  the  model 
and  the  remaining  one  part  is  used  for  testing,  and  the  process  is  repeated  k  times  with  different 
test  splits.  Cross-validation  is  a  standard  evaluation  setting  to  eliminate  any  chances  of  over-fitting. 
Of  course,  /.'-fold  cross-validation  necessitates  building  k  models,  which  may  take  a  long  time 
on  large  datasets.  It  is  also  important  to  note  that  cross-validation  is  supposed  to  be  of  the  entire 
workflow  and  not  just  of  the  predictive  model.  Hence,  any  supervised  data  preprocessing  should 
also  be  considered  along  with  predictive  modeling  while  performing  cross-validation  in  order  to 
get  unbiased  estimates  of  accuracy.  Quantitative  assessments  of  the  models  predictive  accuracy 
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TABLE  I.  Popular  predictive  modeling  algorithms. 


Modeling  technique 

Capability 

Brief  description 

Naive  Bayes12 

Classification 

A  probabilistic  classifier  based  on  Bayes  theorem 

Bayesian  network12 

Classification 

A  graphical  model  that  encodes  probabilistic  conditional 
relationships  among  variables 

Logistic  regression1^ 

Classification 

Fits  data  to  a  sigmoidal  S -shaped  logistic  curve 

Linear  regression12 

Regression 

A  linear  least-squares  fit  of  the  data  w.r.t.  input  features 

Nearest-neighbor 1  ® 

Both 

Uses  the  most  similar  instance  in  the  training  data  for  making 
predictions 

Artificial  neural  networks17,1^ 

Both 

Uses  hidden  layer(s)  of  neurons  to  connect  inputs  and  outputs,  edge 
weights  learnt  using  back  propagation 

Support  vector  machines1® 

Both 

Based  on  the  structural  risk  minimization,  constructs  hyperplanes 
multidimensional  feature  space 

Decision  table2® 

Both 

Constructs  rules  involving  different  combinations  of  attributes 

Decision  stump2 1 

Both 

A  weak  tree-based  machine  learning  model  consisting  of  a 
single-level  decision  tree 

J48  (C4.5)  decision  tree22 

Classification 

A  decision  tree  model  that  identifies  the  splitting  attribute  based  on 
information  gain/gini  impurity 

Alternating  decision  tree22 

Classification 

Tree  consists  of  alternating  prediction  nodes  and  decision  nodes,  an 
instance  traverses  all  applicable  paths 

Logistic  model  tree2"1,22 

Classification 

A  classification  tree  with  logistic  regression  functions  at  the  leaves 

M5  model  tree26,22 

Regression 

A  tree  with  linear  regression  function  at  the  leaves 

Random  tree 

Both 

Considers  a  randomly  chosen  subset  of  attributes 

Reduced  error  pruning  tree2 1 

Both 

Builds  a  tree  using  information  gain/variance  and  prunes  it  using 
reduced-error  pruning  to  avoid  over-fitting 

AdaBoost28 

Ensembling 

Boosting  can  significantly  reduce  error  rate  of  a  weak  learning 
algorithm 

Bagging29 

Ensembling 

Builds  multiple  models  on  bootstrapped  training  data  subsets  to 
improve  model  stability  by  reducing  variance 

Random  subspace2® 

Ensembling 

Constructs  multiple  trees  systematically  by  pseudo-randomly 
selecting  subsets  of  features 

Random  forest2 1 

Ensembling 

An  ensemble  of  multiple  random  trees 

Rotation  forest22 

Ensembling 

Generates  model  ensembles  based  on  feature  extraction  followed  by 

axis  rotations 

can  be  done  with  several  classification/regression  performance  metrics  such  as  accuracy,  precision, 
recall/sensitivity,  specificity,  area  under  the  receiver  operating  characteristics  (ROC)  curve,  coeffi¬ 
cient  of  correlation  ( R ),  explained  variance  (R2),  Mean  Absolute  Error  (MAE),  Root  Mean  Squared 
Error  ( RMSE ),  Standard  Deviation  of  Error  (SD E),  etc. 

The  resulting  knowledge  from  this  workflow  can  be  represented  in  the  form  of  invertible  PSSP 
relationships,  thereby  facilitating  materials  discovery  and  design.  This  workflow  can  be  used  and 
leveraged  at  different  stages  by  various  stakeholders  such  as  experimentalists,  computational  scien¬ 
tists,  and  materials  informaticians.  For  example,  one  can  add  new  data  as  they  are  generated,  get 
summary  statistics,  identify  key  factors  influencing  a  material  property,  find  materials  similar  to  a 
given  material  in  the  database,  develop  forward  predictive  models,  use  it  to  predict  properties  of  new 
materials,  and  finally  develop  inverse  models  to  discover  materials  with  a  desired  property  or  set  of 
properties. 

A  lot  of  research  has  sprung  up  over  the  last  decade  utilizing  data  mining  on  materials  science 
data,9  1 15  5  41  and  all  of  them  can  be  said  to  use  some  flavor  of  the  above-described  materials 
informatics  workflow.  It  is  also  worth  noting  that  this  workflow  is  essentially  a  materials  science 
adaptation  of  existing  similar  workflows  of  data-driven  analytics  in  other  domains,  as  most  of  the 
advanced  techniques  for  big  data  management  and  informatics  come  from  the  field  of  computer 
science  and  more  specifically  high-performance  data  mining,42-50  via  applications  in  many  different 
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domains  like  business  and  marketing,51  53  healthcare,54  60  climate  science,61  63  bioinformatics,64  68 
and  social  media  analytics, 69-71  among  many  others. 


ILLUSTRATIVE  EXAMPLES  OF  MATERIALS  INFORMATICS 
Steel  fatigue  prediction 

Agrawal  et  al. 9  used  data  from  the  Japan  National  Institute  of  Material  Science  (NIMS)  Mat- 
Navi  database72  to  make  predictive  models  for  fatigue  strength  of  steel.  Accurate  prediction  of 
fatigue  strength  of  steels  is  important  for  several  advanced  technology  applications  due  to  the 
extremely  high  cost  and  time  of  fatigue  testing  and  potentially  disastrous  consequences  of  fatigue 
failures.  In  fact,  fatigue  is  known  to  account  for  more  than  90%  of  all  mechanical  failures  of 
structural  components.7.  The  NIMS  data  included  composition  and  processing  attributes  of  371 
carbon  and  low-alloy  steels,  48  carburizing  steels,  and  18  spring  steels.  The  materials  informatics 
approach  used  consisted  of  a  series  of  steps  that  included  data  preprocessing  for  consistency  using 
domain  knowledge,  ranking-based  feature  selection,  predictive  modeling,  and  model  evaluation 
using  leave-one-out  cross-validation  (a  special  case  of  cross-validation  where  k  —  N,  the  number  of 
instances  in  the  dataset;  it  is  generally  used  with  small  datasets  for  a  more  robust  evaluation)  with 
respect  to  various  metrics  for  prediction  accuracy.  Twelve  regression-based  predictive  modeling 
techniques  were  evaluated,  and  many  of  them  were  able  to  achieve  a  high  predictive  accuracy,  with 
R 1  values  -0.98,  and  error  rate  <4%,  outperforming  the  only  prior  study  on  fatigue  strength  predic¬ 
tion  reported  R2  values  of  <0.94.  In  particular,  neural  networks,  decision  trees,  and  multivariate 
polynomial  regression  were  found  to  achieve  a  high  R2  value  of  >0.97.  It  is  well  known  in  the  field 
of  predictive  data  analytics  that  after  a  certain  point  it  becomes  more  and  more  difficult  to  increase 
prediction  accuracy.  In  other  words,  an  improvement  from  0.94  to  0.97  should  be  considered  more 
significant  than  an  improvement  from  0.64  to  0.67.  It  was  also  observed  from  the  scatter  plots 
that  the  three  grades  of  steels  were  well  separated  in  most  cases,  and  different  techniques  tended 
to  perform  better  in  different  regions,  which  was  also  reflected  in  the  distribution  of  errors.  For 
example,  multivariate  polynomial  regression  gave  the  best  R2  of  0.9801  but  also  gave  very  poor 
predictions  for  some  carbon  and  low  alloy  steels.  It  is  encouraging  to  see  good  accuracy  despite 
the  limited  amount  of  available  experimental  data;  however,  it  was  also  concluded  that  the  methods 
resulting  in  bimodal  distribution  of  errors  or  the  ones  with  significant  peaks  in  higher  error  regions 
need  to  be  used  with  caution  even  though  their  reported  R2  may  be  high.  Nonetheless,  this  work 
successfully  exhibited  the  use  of  data  science  methodologies  to  build  forward  models  on  experi¬ 
mental  data  connecting  processing  and  composition  directly  with  performance,  in  the  context  of 
PSPP  relationships. 

Stable  compound  discovery 

In  another  work,  Meredig,  Agrawal  et  al.10  employed  a  similar  predictive  analytics  framework 
on  simulation  data  from  quantum  mechanical  DFT  calculations.  Performing  a  DFT  simulation  on 
a  material  requires  its  composition  and  crystal  structure  as  input.  Traditionally,  there  has  been  a  lot 
of  effort  in  the  field  of  computational  materials  science  on  crystal  structure  prediction  (CSP),  which 
takes  the  composition  as  input  and  performs  a  global  optimization  of  all  possible  unit  cells  to  find 
the  one  with  lowest  formation  energy.  Thus,  both  CSP  and  DFT  are  computationally  expensive, 
and  this  work  used  a  database  of  existing  DFT  calculations  to  build  forward  models  predicting 
the  property  of  a  material  (in  this  case  formation  energy)  using  a  variety  of  attributes  all  of  which 
could  be  derived  solely  based  on  the  materials  stoichiometric  composition.  Of  course,  the  predictive 
models  are  trained  on  DFT  data,  which  implicitly  use  structure  information,  but  once  the  model 
is  built,  it  can  subsequently  be  used  to  predict  the  formation  energy  of  new  materials  without 
crystal  structure  as  input.  It  was  found  that  the  developed  models  were  excellent  at  predicting  DFT 
formation  energies  to  which  they  were  not  fit,  with  R2  score  greater  than  0.9,  and  MAE  well  within 
DFT’s  typical  agreement  with  experiment.  Thus,  these  forward  models  could  predict  formation 
energy  without  any  structural  input,  about  six  orders  of  magnitude  faster  than  DFT,  and  without 
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FIG.  4.  A  simple  realization  of  the  inverse  models  for  PSPP  relationships.  The  forward  predictive  model  built  using  a 
supervised  learning  technique  on  a  labeled  materials  dataset  can  be  used  to  scan  a  combinatorial  set  of  materials  and  thus 
convert  this  set  to  a  ranked  list,  ordered  by  the  predicted  property.  This  can  be  followed  by  one  or  more  screening  steps 
to  select  and  validate  the  predictions  using  simulation  and/or  experiments,  thereby  enabling  data-driven  materials  discovery, 
which  can  in  turn  be  fed  back  into  the  materials  dataset  to  derive  improved  models,  and  so  on.  Blue  arrows  denote  the  forward 
model  construction  process,  and  green  arrows  denote  the  materials  discovery  process  via  inverse  models. 


sacrificing  the  accuracy  of  DFT  compared  to  experiment.  These  models  were  subsequently  used  to 
conduct  “virtual  combinatorial  chemistry”  by  scanning  almost  the  entire  unexplored  ternary  space 
of  compounds  of  the  form  AxByCz,  to  identify  compounds  that  are  predicted  to  be  stable.  This 
is  tantamount  to  the  inverse  models  described  earlier  in  this  article.  Here  A ,  B,  C  were  selected 
from  a  list  of  83  technologically  relevant  elements,  and  x,  y,  z  were  obtained  using  an  enumeration 
procedure  taking  into  account  the  statistically  most  common  compositions  in  the  Inorganic  Crystal 
Structure  Database  (ICSD)  and  basic  charge  balance  conditions.  A  total  of  approximately  1.6  x  106 
ternary  compositions  were  scanned  to  get  predictions  of  formation  energy  in  a  matter  of  minutes,  in 
contrast  to  tens  of  thousands  of  years  that  would  be  required  for  DFT  simulations  to  do  the  same. 
Many  interesting  insights  were  obtained  as  a  result  of  this  study,  which  identified  about  4500  ternary 
compositions  as  predictions  for  new  stable  compounds.  Of  these,  9  were  systematically  selected  for 
a  full  DFT  crystal  structure  test,  and  8  of  them  were  explicitly  confirmed  to  be  stable.  Interestingly, 
if  the  4500  predictions  reported  in  this  study  are  experimentally  confirmed,  it  would  represent  an 
increase  in  the  total  number  of  known  stable  ternary  compounds  by  more  than  10%.  This  realization 
of  inverse  models  is  depicted  in  Figure  4. 

Data-driven  microstructure  optimization 

The  last  example  that  we  would  like  to  discuss  is  the  recent  work  on  microstructure  opti¬ 
mization  of  a  magnetoelastic  Fe-Ga  alloy  (Galfenol)  microstructure  for  enhanced  elastic,  plastic, 
and  magnetostrictive  properties.11  When  a  magnetic  field  is  applied  to  this  alloy,  the  boundaries 
between  the  magnetic  domains  shift  and  rotate,  and  this  leads  to  a  change  in  its  dimensions.  This 
behavior  is  called  magnetostriction,  and  has  many  applications  in  microscale  sensors,  actuators,  and 
energy  harvesting  devices.  Desirable  properties  for  such  a  material  include  low  young’s  modulus, 
high  magnetostrictive  strain,  and  high  yield  strength.  Theoretical  forward  models  for  computing 
these  properties  for  a  given  microstructure  are  available,  but  inverse  models  to  obtain  microstruc¬ 
tures  with  the  desired  properties  are  very  challenging.  Note  that  the  approach  used  in  the  previ¬ 
ous  example — stable  compound  discovery  where  the  forward  model  could  scan  the  entire  ternary 
composition  space  to  realize  the  inverse  model — would  be  prohibitive  here,  since  the  microstructure 
space  is  too  large.  In  this  example,  the  microstructure  was  represented  by  an  orientation  distribution 
function  (ODF)  with  76  independent  nodes  leading  to  a  76  dimensional  inverse  problem  (each  value 
representing  positive  volume  density  of  a  crystal  orientation).  For  a  conservative  estimate  of  the 
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number  of  possible  combinations,  even  if  we  assume  just  two  possible  values  for  each  dimension, 
it  would  result  in  the  order  of  276  combinations,  which  would  take  more  than  2  x  109  years  to 
enumerate,  even  assuming  it  takes  just  1  jj.\  to  compute  a  property  with  forward  models.  Thus,  high 
dimensionality  of  microstructure  space  along  with  other  challenges  such  as  multi-objective  design 
requirements  and  non-uniqueness  of  solutions  makes  even  the  traditional  search-based  optimization 
methods  incompetent  in  terms  of  both  searching  efficiency  and  result  optimality.  In  this  work,  a 
machine  learning  approach  to  address  these  challenges  was  developed,  consisting  of  random  data 
generation,  feature  selection,  and  classification  algorithms.  The  key  idea  was  to  prune  the  search 
space  as  much  as  possible  by  search  path  refinement  (ranking  the  ODF  dimensions)  and  search 
region  reduction  (identifying  promising  regions  within  the  range  of  each  ODF  dimension)  so  as  to 
try  to  reach  the  optimal  solution  faster.  More  details  of  this  approach  are  available  elsewhere.74  It 
was  found  that  this  data-driven  approach  for  structure-property  optimization  could  not  only  identify 
more  optimal  microstructures  satisfying  multiple  linear  and  non-linear  properties  much  faster  than 
traditional  optimization  methods,  but  also  discover  multiple  optimal  solutions  for  some  properties 
that  were  unknown  for  this  problem  before  this  study. 


CONCLUSION 

To  summarize  and  conclude,  (big)  data  driven  analytics,  which  is  the  fourth  paradigm  of  sci¬ 
ence  has  given  rise  to  the  emergence  and  popularity  of  materials  informatics,  and  it  is  of  central 
importance  in  realizing  the  vision  of  the  materials  genome  initiative.  We  discussed  a  generic  work- 
flow  for  materials  informatics  along  with  three  recent  illustrative  applications  where  data-driven 
analytics  has  been  successfully  used  to  leam  invertible  PSPP  relationships.  Currently,  the  field  of 
materials  informatics  is  still  pretty  much  in  its  nascent  stage,  much  like  what  bioinformatics  was 
20  years  ago.  Interdisciplinary  collaborations  bringing  together  expertise  from  materials  science  and 
computer  science,  and  creating  a  workforce  equipped  with  such  interdisciplinary  skills,  are  vital  to 
optimally  harness  the  wealth  of  opportunities  becoming  available  in  this  arena,  and  enable  timely 
discovery  and  deployment  of  advanced  materials  for  the  benefit  of  mankind. 
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